Overview

This notebook contains the analysis of Hypothesis 4, as preregistered at https://osf.io/vwg6u/:

The absolute distance between intermediate estimates and the final estimate of the rate of forgetting for a particular learner/fact combination during a learning session will converge on zero (i.e., fall within a specified boundary above zero) within fewer adjustments when the initial estimate of the rate of forgetting was based on the most predictive combination(s) of learner and/or fact history than when the initial estimate of the rate of forgetting was set at the default value of 0.3.

Setup

Load packages

library(BayesFactor)
library(dplyr)
library(forcats)
library(ggplot2)
library(purrr)
library(readr)
library(stringr)
library(tidyr)
library(tikzDevice)
library(lme4)
library(lmerTest)

Attaching package: ‘lmerTest’

The following object is masked from ‘package:lme4’:

    lmer

The following object is masked from ‘package:stats’:

    step
library(fst)
theme_set(theme_light(base_size = 14) +
  theme(strip.text = element_text(colour = "black")))
knitr::opts_chunk$set(fig.width=16, fig.height=16) 

Load data

responses_lab_2_with_model_params <- read.fst(file.path("..", "data", "processed", "lab", "session2", "session2_rl2_with_alpha_lab.fst"))
responses_mturk_2_with_model_params <- read.fst(file.path("..", "data", "processed", "mturk", "session2", "session2_rl2_with_alpha_mturk.fst"))
fix_condition_labels <- function(x) {
  rowwise(x) %>%
  mutate(condition = str_replace(condition, "-and-", " & ") %>%
           str_replace("student", "learner") %>%
           str_to_title()) %>%
  ungroup() %>%
  mutate(condition = fct_relevel(condition, "Fact & Learner", after = Inf)) # Move F&L level to the end
}
block2_with_model_params <- bind_rows(mutate(responses_lab_2_with_model_params, dataset = "Lab"),
                                      mutate(responses_mturk_2_with_model_params, dataset = "MTurk")) %>%
  fix_condition_labels()

Subset data

The preregistration specified 22 April 2019 as the cutoff date for data collection. We reached the minimum of 25 participants per condition in the lab sample before this date, but not in the Mechanical Turk sample. For that reason data collection on MTurk continued until there were at least 25 participants per condition (14 May 2019) and was stopped only then. The analysis reported in the paper is based only on the data from before the cutoff date, but this notebook also reports the same analysis done on the full dataset.

subjects_cutoff <- read_csv(file.path("..", "data", "processed", "subjects_before_cutoff.csv"))
Parsed with column specification:
cols(
  subject = col_character()
)
block_2_full <- block2_with_model_params 
block_2_cutoff <- block2_with_model_params %>%
  right_join(subjects_cutoff, by = "subject")

Convergence of the estimate

Each trial has a value alpha that expresses the estimated rate of forgetting of the fact at the start of the trial. The rate of forgetting estimate starts at the predicted value (which depends on the condition). It is then updated after each repetition of the fact to better match the observed responses.

alpha_change <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  arrange(subject, condition, fact_id, repetition) %>%
  mutate(d_alpha = alpha - lag(alpha)) %>%
  mutate(d_alpha = ifelse(is.na(d_alpha), 0, d_alpha)) %>%
  mutate(abs_d_alpha = abs(d_alpha)) %>%
  mutate(final_alpha = tail(alpha,1)) %>%
  ungroup() %>%
  select(dataset, subject, condition, fact_id, repetition, alpha, d_alpha, abs_d_alpha, final_alpha)

The plot below visualises the development of each alpha estimate (every learner/fact combination is represented by a line). It shows several things:

  • The initial rate of forgetting estimate differs depending on the condition: in conditions with a domain-wide prediction (Default and Domain) the same value is always used, while in conditions with more individualised predictions (Fact and/or Learner) the starting value differs between facts and/or learners.
  • The rate of forgetting estimate is only changed after the third presentation of the fact. Up to that point it is always a horizontal line.
  • Because of the way the scheduling algorithm works, there is invariably a plume towards the top right: facts with a lower rate of forgetting are repeated less frequently than facts with a higher rate of forgetting.
ggplot(alpha_change, aes(x = repetition, y = alpha, group = interaction(subject, fact_id))) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_line(alpha = 0.1) +
  labs(x = "Presentation",
       y = "Rate of forgetting")

Our question is whether the model can find the correct rate of forgetting more quickly in conditions that use a predicted value as their starting point.

The plot below shows the change in the rate of forgetting estimate compared to the previous presentation. Note that changes are capped at 0.0496 in both directions. It appears that many of the changes are as large as possible.

ggplot(alpha_change, aes(x = repetition, y = d_alpha, group = interaction(subject, fact_id))) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_line(alpha = 0.1) +
  labs(x = "Presentation",
       y = "Change in rate of forgetting")

The distribution of changes (shown below, excluding the first three presentations since they can never have changes) confirms that many changes are either as large as possible, or almost zero. The dotted horizontal lines show the boundaries of what we consider to be the convergence zone (0.00496 on either side of zero). This plot also gives an indication of the balance between upward and downward changes. It is especially apparent that changes in the Default condition are biased towards upward changes, which makes sense given the general diffulty of the material. It looks like the changes may be more balanced in other conditions.

alpha_change %>%
  filter(repetition > 3) %>%
ggplot(aes(x = condition, y = d_alpha)) +
  facet_grid(dataset ~ ., labeller = label_wrap_gen()) +
  geom_jitter(aes(colour = condition), width = 0.1, height = 0, alpha = 0.1) +
  geom_violin(fill = NA) +
  geom_hline(yintercept = c(0.00496, -0.00496), lty = 2) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Change in rate of forgetting",
       caption = "Excluding the first 3 trials, in which RoF cannot change.")

alpha_change %>%
  filter(repetition > 3) %>%
ggplot(aes(x = d_alpha)) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_histogram(aes(fill = abs_d_alpha <= 0.00496), binwidth = 0.002) +
  labs(x = "Change in rate of forgetting",
       fill = "Within\nconvergence\nzone",
       caption = "Excluding the first 3 trials, in which RoF cannot change.")

The number of presentations per fact, which we expected to drop with better prediction, in fact seems to be higher in predictive conditions, if anything:

n_reps <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  summarise(repetitions = max(repetition))
ggplot(filter(n_reps, repetitions >= 4), aes(x = condition, y = repetitions)) +
  facet_grid(dataset ~ .) +
  geom_violin() +
  geom_jitter(aes(colour = condition), width = 0.05, height = 0, alpha = 0.5) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Repetitions",
       caption = "Excluding facts with fewer than 4 repetitions")

This might simply be an effect of the high difficulty of items overall: in most cases, if the rate of forgetting is estimated more accurately from the start, the item should be repeated more often since it is correctly assessed as more difficult. We would expect that for the easiest facts (of which there are only a few), prediction does decrease repetitions compared to the default condition.

As a sanity check, verify that the number of repetitions scales with the final alpha estimate:

n_reps_by_alpha <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  summarise(repetitions = max(repetition),
            alpha = alpha[which.max(repetition)])
ggplot(filter(n_reps_by_alpha, repetitions >= 4), aes(x = alpha, y = repetitions, colour = condition)) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_jitter(alpha = 0.5) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE)

The plot below shows only the number of repetitions for cases in which the final alpha is lower than 0.3, and hints at an improvement (particularly in the MTurk data). Starting off with a lower-frequency repetition schedule indeed seems to lead to fewer repetitions overall.

n_reps_by_alpha <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  summarise(repetitions = max(repetition),
            alpha = alpha[which.max(repetition)])
ggplot(filter(n_reps_by_alpha, repetitions >= 4, alpha < 0.3), aes(x = condition, y = repetitions, colour = condition)) +
  facet_grid(dataset ~ .) +
  geom_violin() +
  geom_jitter(width = 0.05, height = 0, alpha = 0.5) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Repetitions",
       caption = "Only showing facts with alpha < 0.3; excluding facts with fewer than 4 repetitions")

Indeed, an exploratory Bayesian ANOVA shows strong evidence for an effect of condition on the number of repetitions among low-alpha items (but no effect of dataset):

n_reps_by_alpha_low <- n_reps_by_alpha %>%
  ungroup() %>%
  mutate_if(is_character, as.factor) %>%
  filter(repetitions >= 4, alpha < 0.3)
anovaBF(
  repetitions ~ condition * dataset,
  data = n_reps_by_alpha_low,
  progress = FALSE
)
data coerced from tibble to data frame
Bayes factor analysis
--------------
[1] dataset                                 : 0.5233318 ±0%
[2] condition                               : 142.7986  ±0%
[3] dataset + condition                     : 35.61571  ±1.36%
[4] dataset + condition + dataset:condition : 83738.58  ±1.5%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

Follow-up t-tests indicate that adapation in general has a beneficial effect...

ttestBF(filter(n_reps_by_alpha_low, condition == "Default")$repetitions, filter(n_reps_by_alpha_low, condition != "Default")$repetitions)
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 159.0542 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

... but that there is only anecdotal evidence supporting a more individualised adaptation over a domain-wide adaptation:

ttestBF(filter(n_reps_by_alpha_low, condition == "Domain")$repetitions, filter(n_reps_by_alpha_low, condition %in% c("Fact & Learner", "Fact", "Learner"))$repetitions)
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 2.047736 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

If we plot the cumulative percentage of converged estimates, we see that estimates in the default condition are more likely to converge than those in other conditions---not what we expected.

# convergence <- alpha_change %>%
#   group_by(dataset, subject, condition, fact_id) %>%
#   filter(abs_d_alpha > 0.00496) %>% # Keep cases in which the adjustment is outside the window
#   summarise(convergence_point = max(repetition) + 1) %>%
#   ungroup() %>%
#   mutate(dataset = as.factor(dataset),
#          subject = as.factor(subject),
#          condition = as.factor(condition),
#          fact_id = as.factor(fact_id))
# convergence <- alpha_change %>%
#   group_by(dataset, subject, condition, fact_id) %>%
#   mutate(within_boundary = abs_d_alpha <= 0.00496) %>%
#   mutate(total_reps = max(repetition)) %>%
#   filter(!within_boundary) %>%
#   summarise(total_reps = total_reps[1], convergence_point = max(repetition) + 1, final_alpha = alpha[which.max(repetition)]) %>%
#   mutate(convergence_point = ifelse(convergence_point > total_reps, NA, convergence_point))
convergence <- alpha_change %>%
  group_by(dataset, subject, condition, fact_id) %>%
  filter(max(repetition) > 3) %>%
  mutate(within_boundary = abs_d_alpha <= 0.00496) %>%
  mutate(total_reps = max(repetition)) %>%
  summarise(total_reps = total_reps[1],
            convergence_point = last(repetition[!within_boundary]) + 1) %>%
  mutate(convergence_point = ifelse(convergence_point > total_reps, NA, convergence_point))
convergence_cumulative <- convergence %>%
  group_by(dataset, condition) %>%
  count(convergence_point) %>%
  complete(nesting(dataset, condition), convergence_point = c(0:max(convergence$convergence_point, na.rm = T), NA), fill = list(n = 0)) %>%
  mutate(perc_converged = n / sum(n)) %>%
  mutate(perc_converged = cumsum(perc_converged)) %>%
  ungroup() %>%
  arrange(dataset, condition, convergence_point) %>%
  fill(perc_converged)
  
convergence_cumulative_combined <- convergence %>%
  group_by(condition) %>%
  count(convergence_point) %>%
  complete(nesting(condition), convergence_point = c(0:max(convergence$convergence_point, na.rm = T), NA), fill = list(n = 0)) %>%
  mutate(perc_converged = n / sum(n)) %>%
  mutate(perc_converged = cumsum(perc_converged)) %>%
  ungroup() %>%
  arrange(condition, convergence_point) %>%
  fill(perc_converged)
convergence_cumulative_combined %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = convergence_point, y = perc_converged, colour = condition, lty = condition)) +
  # facet_wrap(~ dataset) +
  geom_line(size = 1) +
  scale_colour_brewer(type = "qual", palette = 7) +
  scale_y_continuous(labels = scales::percent_format()) +
  expand_limits(x = 0, y = 0) +
  labs(x = "Repetition",
       y = "Estimates converged",
       colour = "Condition",
       lty = "Condition")

       # caption = "Convergence means that there are no more adjustments larger than 0.00496")
ggsave("../output/cum_perc_converged.pdf", device = "pdf", width = 5, height = 3)

Make the plot shown in the paper:

convergence_cumulative_combined_tex <- convergence_cumulative_combined
levels(convergence_cumulative_combined_tex$condition)[5] <- "Fact \\& Learner"
tikz(file = "../output/cum_perc_converged.tex", width = 4.75, height = 2)
convergence_cumulative_combined_tex %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = convergence_point, y = perc_converged, colour = condition, lty = condition)) +
  geom_line(size = 1) +
  scale_colour_brewer(type = "qual", palette = 7) +
  scale_y_continuous(labels = scales::percent_format(suffix = "\\%")) +
  expand_limits(x = 0, y = 0) +
  labs(x = "Repetition",
       y = "Estimates converged",
       colour = "Condition",
       lty = "Condition") +
  theme_light()
dev.off()
null device 
          1 

What proportion of estimates converged in each condition?

convergence_cumulative %>%
  group_by(dataset, condition) %>%
  filter(!is.na(convergence_point)) %>%
  summarise(n_converged = sum(n),
            perc_converged = max(perc_converged))

For the estimates that did converge, plot the distribution of convergence points by condition:

convergence %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = condition, y = convergence_point)) +
  # facet_grid(dataset ~ ., labeller = label_wrap_gen()) +
  geom_jitter(aes(colour = condition), width = 0.1, height = 0, alpha = 0.1) +
  geom_violin(fill = NA) +
  expand_limits(y = 0) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Convergence point")
ggsave("../output/convergence_point.pdf", device = "pdf", width = 5, height = 3)

Make the plot shown in the paper:

convergence_tex <- convergence
levels(convergence_tex$condition)[5] <- "Fact \\& Learner"
tikz(file = "../output/convergence_point.tex", width = 4.75, height = 2)
convergence_tex %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = condition, y = convergence_point)) +
  # facet_grid(dataset ~ ., labeller = label_wrap_gen()) +
  geom_jitter(aes(colour = condition), width = 0.1, height = 0, alpha = 0.1) +
  geom_violin(fill = NA) +
  expand_limits(y = 0) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Convergence point") +
  theme_light()
dev.off()
null device 
          1 

As preregistered, we conduct a Bayesian ANOVA testing the effect of condition on the convergence point.

There is very strong evidence for this model compared to a null model.

convergence_dat <- convergence %>%
  ungroup() %>%
  filter(!is.na(convergence_point)) %>%
  mutate_if(is.character, as.factor) %>%
  mutate(fact_id = as.factor(fact_id))
bf_convergence <- lmBF(
  formula = convergence_point ~ condition * dataset + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)
data coerced from tibble to data frame
  
bf_convergence  
Bayes factor analysis
--------------
[1] condition * dataset + subject + fact_id : 594003.3 ±1.41%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS
1/bf_convergence
Bayes factor analysis
--------------
[1] Intercept only : 1.683492e-06 ±1.41%

Against denominator:
  convergence_point ~ condition * dataset + subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS

To check whether we should test the two datasets separately, compare the first model to one without the interaction between condition and dataset. This comparison shows that there is strong evidence for the model without the interaction and against the model with interaction.

bf_convergence_nointeraction <- lmBF(
  formula = convergence_point ~ condition + dataset + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)
data coerced from tibble to data frame
  
bf_convergence_nointeraction / bf_convergence
Bayes factor analysis
--------------
[1] condition + dataset + subject + fact_id : 23.70967 ±2.05%

Against denominator:
  convergence_point ~ condition * dataset + subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS

A model that leaves out dataset altogether is supported even more strongly by the evidence, showing that we can assume no effect of dataset on the convergence point.

bf_convergence_nodataset <- lmBF(
  formula = convergence_point ~ condition + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)
data coerced from tibble to data frame
bf_convergence_nodataset / bf_convergence_nointeraction
Bayes factor analysis
--------------
[1] condition + subject + fact_id : 10.27426 ±1.57%

Against denominator:
  convergence_point ~ condition + dataset + subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS

Compare to the maximal model:

bf_convergence_nodataset / bf_convergence
Bayes factor analysis
--------------
[1] condition + subject + fact_id : 243.5993 ±1.5%

Against denominator:
  convergence_point ~ condition * dataset + subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS

Compare the preferred simpler model to a model without condition effect:

bf_convergence_nocondition <-  lmBF(
  formula = convergence_point ~ subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)
data coerced from tibble to data frame
bf_convergence_nocondition / bf_convergence_nodataset 
Bayes factor analysis
--------------
[1] subject + fact_id : 11.60565 ±0.59%

Against denominator:
  convergence_point ~ condition + subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS
bf_convergence_nocondition / bf_convergence_nointeraction
Bayes factor analysis
--------------
[1] subject + fact_id : 119.2395 ±1.51%

Against denominator:
  convergence_point ~ condition + dataset + subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS
bf_convergence_nocondition / bf_convergence
Bayes factor analysis
--------------
[1] subject + fact_id : 2827.129 ±1.43%

Against denominator:
  convergence_point ~ condition * dataset + subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS

Although it's not the best model, we can sample from the posterior of the model with condition to see what it would predict:

samples <- posterior(bf_convergence_nodataset, iterations = 1000, progress = FALSE)
summary(samples[,1:6])

Iterations = 1:1000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 1000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                             Mean     SD Naive SE Time-series SE
mu                        8.19349 0.2730 0.008632       0.008070
condition-Default        -0.57755 0.2646 0.008367       0.008929
condition-Domain          0.04970 0.2697 0.008527       0.008527
condition-Fact            0.23992 0.2329 0.007366       0.007366
condition-Learner        -0.08596 0.2761 0.008732       0.008732
condition-Fact & Learner  0.37389 0.2815 0.008902       0.008902

2. Quantiles for each variable:

                            2.5%      25%      50%      75%    97.5%
mu                        7.6596  8.00411  8.20034  8.37465  8.71860
condition-Default        -1.0804 -0.75229 -0.59040 -0.39535 -0.03867
condition-Domain         -0.4836 -0.13903  0.04772  0.23367  0.57610
condition-Fact           -0.2158  0.08153  0.24698  0.40370  0.68763
condition-Learner        -0.6423 -0.26590 -0.07275  0.09449  0.43159
condition-Fact & Learner -0.1869  0.18210  0.38346  0.56493  0.92216

Exploratory: the effect of the RoF estimate on convergence

Convergence vs. no convergence

convergence_traces <- convergence %>%
  mutate(converges = !is.na(convergence_point)) %>%
  left_join(alpha_change, by = c("dataset", "subject", "condition", "fact_id"))

Trajectories

ggplot(convergence_traces, aes(x = repetition, y = alpha, colour = interaction(subject, fact_id))) +
  facet_grid(converges ~ condition) +
  geom_line(alpha = 0.1) +
  geom_point(alpha = 0.1) +
  guides(colour = FALSE) +
  labs(title = "Trajectories that do not converge (top) and those that do (bottom)",
       y = "Rate of forgetting")

ggplot(convergence_traces, aes(x = repetition, y = d_alpha, colour = interaction(subject, fact_id))) +
  facet_grid(converges ~ condition) +
  geom_line(alpha = 0.1) +
  geom_point(alpha = 0.1) +
  geom_hline(yintercept = c(-0.00496, 0.00496), lty = 2) +
  guides(colour = FALSE) +
  labs(title = "Trajectories that do not converge (top) and those that do (bottom)",
       y = "Change in rate of forgetting")

Indexed on the final value:

convergence_traces %>%
  group_by(dataset, subject, condition, fact_id) %>%
  mutate(alpha_endaligned = alpha - final_alpha) %>%
  ggplot(aes(x = repetition, y = alpha_endaligned, colour = interaction(subject, fact_id))) +
  facet_grid(converges ~ condition) +
  geom_line(alpha = 0.1) +
  geom_point(alpha = 0.1) +
  guides(colour = FALSE) +
  labs(title = "Trajectories that do not converge (top) and those that do (bottom)",
       y = "Rate of forgetting (indexed on final value)")

Convergence point ~ number of repetitions

The convergence point is by definition capped to the total number of repetitions a fact receives (at the latest, we can observe convergence at the final repetition). But it is also quite clear that convergence rarely happens before the final repetition, which requires more than two small changes in a row. Pearson's r: 0.9789794.

ggplot(filter(convergence, !is.na(convergence_point)), aes(x = total_reps, y = convergence_point)) +
  geom_jitter(alpha = 0.2) +
  coord_equal() +
  geom_abline(slope = 1, intercept = 0, lty = 2)

Do changes in RoF become smaller as the number of repetitions increases? As the plot below shows: yes, but up to a point. After about 7 repetitions the size of the change stays more or less constant.

convergence_traces %>%
  group_by(repetition) %>%
  summarise(mean_change = mean(abs_d_alpha),
            se_change = plotrix::std.error(abs_d_alpha)) %>%
  ggplot(aes(x = repetition, y = mean_change)) +
  geom_point() +
  geom_errorbar(aes(ymin= mean_change - se_change, ymax = mean_change + se_change)) +
  labs(x = "Repetition",
       y = "Mean change in RoF estimate",
       caption = "Error bars show +/- 1 standard error of the mean")

Convergence point ~ final RoF

The convergence point seems to depend on the final rate of forgetting: the higher the rate of forgetting, the later convergence happens. Importantly, this trend seems to exist in all conditions.

quantiles <- quantile(filter(distinct(convergence_traces, dataset, subject, condition, fact_id, converges, final_alpha), converges)$final_alpha,
                      probs = seq(0, 1, 0.2))
convergence_traces %>%
  filter(converges) %>%
  ungroup() %>%
  mutate(final_alpha_bin = cut(final_alpha, quantiles, include.lowest = TRUE)) %>%
  group_by(dataset, subject, condition, fact_id, final_alpha) %>%
  slice(n()) %>%
  ggplot(aes(x = final_alpha_bin, y = convergence_point)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = final_alpha_bin), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE) +
  labs(x = "Final rate of forgetting (binned in quantiles)",
       y = "Convergence point") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

NA
convergence_final_rof <- distinct(convergence_traces, dataset, subject, condition, fact_id, final_alpha, converges, convergence_point) %>%
  ungroup() %>%
  filter(converges) %>%
  mutate(dataset = as.factor(dataset),
         subject = as.factor(subject),
         fact_id = as.factor(fact_id))
m_conv_final_rof <- lmer(convergence_point ~ final_alpha * condition  + (1 | subject) + (1 | fact_id), data = convergence_final_rof)
summary(m_conv_final_rof)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: convergence_point ~ final_alpha * condition + (1 | subject) +      (1 | fact_id)
   Data: convergence_final_rof

REML criterion at convergence: 3970.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.1512 -0.5529  0.0085  0.5986  6.1152 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 0.05296  0.2301  
 fact_id  (Intercept) 0.01241  0.1114  
 Residual             3.68901  1.9207  
Number of obs: 958, groups:  subject, 284; fact_id, 30

Fixed effects:
                                    Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)                          -8.2510     0.8499 910.8487  -9.708   <2e-16 ***
final_alpha                          38.0764     1.9958 935.2891  19.078   <2e-16 ***
conditionDomain                      -0.4480     1.2798 934.7219  -0.350    0.726    
conditionFact                        -0.7385     1.0884 908.8547  -0.678    0.498    
conditionLearner                     -0.3575     1.1759 889.9558  -0.304    0.761    
conditionFact & Learner              -1.4698     1.2512 881.5932  -1.175    0.240    
final_alpha:conditionDomain           1.3849     2.9387 945.7409   0.471    0.638    
final_alpha:conditionFact             1.9846     2.5038 931.8435   0.793    0.428    
final_alpha:conditionLearner          0.2059     2.6888 920.9964   0.077    0.939    
final_alpha:conditionFact & Learner   3.3828     2.8570 922.8715   1.184    0.237    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) fnl_lp cndtnD cndtnF cndtnL cndF&L fnl_:D fnl_:F fnl_:L
final_alpha -0.987                                                        
conditinDmn -0.663  0.655                                                 
conditinFct -0.779  0.769  0.518                                          
conditnLrnr -0.722  0.713  0.480  0.563                                   
cndtnFct&Lr -0.678  0.669  0.451  0.529  0.490                            
fnl_lph:cnD  0.670 -0.679 -0.987 -0.523 -0.484 -0.455                     
fnl_lph:cnF  0.785 -0.796 -0.522 -0.985 -0.568 -0.534  0.541              
fnl_lph:cnL  0.732 -0.741 -0.486 -0.571 -0.984 -0.497  0.504  0.591       
fnl_lph:F&L  0.688 -0.697 -0.458 -0.537 -0.498 -0.986  0.474  0.556  0.518

The lmer confirms a large, positive main effect of final rate of forgetting on convergence point, but no effect of, or interaction with, condition.

This implies that, regardless of the condition, convergence will happen sooner if the final estimate is lower.

Compare the final rate of forgetting distribution of estimates that did converge to those that did not.

convergence_traces %>%
  distinct(dataset, subject, condition, fact_id, .keep_all = TRUE) %>%
  ggplot(aes(x = converges, y = final_alpha)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = converges), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE)

conv <- convergence_traces %>%
  distinct(dataset, subject, condition, fact_id, .keep_all = TRUE)
m_conv_final <- lmer(converges ~ final_alpha * condition + (1 | subject) + (1 | fact_id), data = conv)
summary(m_conv_final)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: converges ~ final_alpha * condition + (1 | subject) + (1 | fact_id)
   Data: conv

REML criterion at convergence: 4201

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.2839 -0.6142 -0.3359 -0.0556  2.4759 

Random effects:
 Groups   Name        Variance  Std.Dev.
 subject  (Intercept) 0.0009191 0.03032 
 fact_id  (Intercept) 0.0002195 0.01482 
 Residual             0.1511051 0.38872 
Number of obs: 4364, groups:  subject, 291; fact_id, 30

Fixed effects:
                                      Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)                           -0.47314    0.06319 1455.58226  -7.488 1.21e-13 ***
final_alpha                            2.05334    0.16705 2217.00175  12.292  < 2e-16 ***
conditionDomain                        0.08191    0.09391 1893.61332   0.872   0.3832    
conditionFact                          0.07994    0.08209 2012.05410   0.974   0.3303    
conditionLearner                       0.10055    0.09120 1919.40134   1.102   0.2704    
conditionFact & Learner                0.08513    0.09204 2116.49983   0.925   0.3551    
final_alpha:conditionDomain           -0.52471    0.24141 2755.45593  -2.174   0.0298 *  
final_alpha:conditionFact             -0.52171    0.21251 2865.73697  -2.455   0.0141 *  
final_alpha:conditionLearner          -0.55949    0.23374 2716.39463  -2.394   0.0167 *  
final_alpha:conditionFact & Learner   -0.52013    0.23608 2983.60844  -2.203   0.0277 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) fnl_lp cndtnD cndtnF cndtnL cndF&L fnl_:D fnl_:F fnl_:L
final_alpha -0.971                                                        
conditinDmn -0.669  0.651                                                 
conditinFct -0.765  0.744  0.515                                          
conditnLrnr -0.689  0.670  0.464  0.530                                   
cndtnFct&Lr -0.682  0.664  0.460  0.526  0.473                            
fnl_lph:cnD  0.670 -0.689 -0.975 -0.516 -0.464 -0.460                     
fnl_lph:cnF  0.760 -0.782 -0.512 -0.973 -0.527 -0.522  0.542              
fnl_lph:cnL  0.691 -0.712 -0.465 -0.532 -0.973 -0.475  0.493  0.559       
fnl_lph:F&L  0.684 -0.704 -0.461 -0.527 -0.474 -0.974  0.488  0.554  0.504

Takeaway:

  • Estimates that do converge tend to have a higher final value than estimates that don't (which makes some sense, since there are more opportunities for adjustment/stabilising).
  • For estimates that converge, convergence happens sooner when the final value is lower.
  • Tradeoff: lower final value means less likely to converge, but when convergence does happen, it's faster.

Convergence point ~ initial RoF

Is the above also true for the initial rate of forgetting estimate? Maybe, judging by the plot below (note that the quantiles are recalculated based on initial alpha values).

quantiles <- quantile(filter(convergence_traces, converges, repetition == 1)$alpha,
                      probs = seq(0, 1, 0.2))
convergence_traces %>%
  filter(converges, repetition == 1) %>%
  ungroup() %>%
  mutate(alpha_bin = cut(alpha, quantiles, include.lowest = TRUE)) %>%
  ggplot(aes(x = alpha_bin, y = convergence_point)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = alpha_bin), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE) +
  labs(x = "Initial rate of forgetting (binned in quantiles)",
       y = "Convergence point") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

convergence_init_rof <- convergence_traces %>%
  ungroup() %>%
  filter(converges, repetition == 1) %>%
  mutate(dataset = as.factor(dataset),
         subject = as.factor(subject),
         fact_id = as.factor(fact_id))
m_conv_init_rof <- lmer(convergence_point ~ alpha * condition  + (1 | subject) + (1 | fact_id), data = convergence_init_rof)
fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
summary(m_conv_init_rof)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: convergence_point ~ alpha * condition + (1 | subject) + (1 |      fact_id)
   Data: convergence_init_rof

REML criterion at convergence: 5011

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8958 -0.5854 -0.2257  0.3183  8.3311 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept)  1.0468  1.0231  
 fact_id  (Intercept)  0.4639  0.6811  
 Residual             10.1190  3.1810  
Number of obs: 958, groups:  subject, 284; fact_id, 30

Fixed effects:
                        Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)             -10.3766     3.3206 704.3402  -3.125  0.00185 ** 
alpha                    60.1350    11.0435 709.2184   5.445 7.14e-08 ***
conditionDomain          14.2744     9.9482 209.6585   1.435  0.15282    
conditionFact            10.6020     3.9116 809.2985   2.710  0.00686 ** 
conditionLearner          3.5360     4.4931 405.1176   0.787  0.43175    
conditionFact & Learner  -2.2479     0.7197 387.1621  -3.123  0.00192 ** 
alpha:conditionDomain   -47.2044    28.9796 220.5039  -1.629  0.10477    
alpha:conditionFact     -37.2203    12.2891 803.9285  -3.029  0.00253 ** 
alpha:conditionLearner  -18.4022    13.8335 460.6365  -1.330  0.18409    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) alpha  cndtnD cndtnF cndtnL cndF&L alph:D alph:F
alpha       -0.996                                                 
conditinDmn -0.328  0.328                                          
conditinFct -0.800  0.797  0.266                                   
conditnLrnr -0.744  0.742  0.245  0.592                            
cndtnFct&Lr  0.798 -0.831 -0.265 -0.635 -0.594                     
alph:cndtnD  0.374 -0.376 -0.998 -0.302 -0.279  0.315              
alph:cndtnF  0.852 -0.854 -0.283 -0.991 -0.631  0.709  0.324       
alph:cndtnL  0.799 -0.802 -0.263 -0.637 -0.992  0.666  0.301  0.683
fit warnings:
fixed-effect model matrix is rank deficient so dropping 1 column / coefficient

The lmer confirms that the convergence point increases as a function of the initial rate of forgetting (alpha) in all conditions, even accounting for the interaction between the initial alpha and the condition.

Convergence seems to be about equally likely for low and high initial values.

convergence_traces %>%
  distinct(dataset, subject, condition, fact_id, .keep_all = TRUE) %>%
  ggplot(aes(x = converges, y = alpha)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = converges), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE)

The lack of an effect is confirmed by an lmer:

m_conv_init <- lmer(converges ~ alpha * condition + (1 | subject) + (1 | fact_id), data = conv)
fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
summary(m_conv_init)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: converges ~ alpha * condition + (1 | subject) + (1 | fact_id)
   Data: conv

REML criterion at convergence: 4644.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.1161 -0.5731 -0.4636 -0.2834  2.2070 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 0.003377 0.05811 
 fact_id  (Intercept) 0.003009 0.05486 
 Residual             0.164596 0.40570 
Number of obs: 4364, groups:  subject, 291; fact_id, 30

Fixed effects:
                          Estimate Std. Error         df t value Pr(>|t|)  
(Intercept)                0.39307    0.19291 1541.79816   2.038   0.0418 *
alpha                     -0.37481    0.63989 1589.07872  -0.586   0.5581  
conditionDomain            0.27153    0.56167  299.59644   0.483   0.6291  
conditionFact             -0.28802    0.22118 2636.45689  -1.302   0.1930  
conditionLearner          -0.39459    0.25428  683.42709  -1.552   0.1212  
conditionFact & Learner   -0.05310    0.04077  661.14912  -1.303   0.1932  
alpha:conditionDomain     -0.93541    1.63596  315.01413  -0.572   0.5679  
alpha:conditionFact        0.63970    0.69997 2480.83176   0.914   0.3609  
alpha:conditionLearner     0.97921    0.78805  820.90581   1.243   0.2144  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) alpha  cndtnD cndtnF cndtnL cndF&L alph:D alph:F
alpha       -0.995                                                 
conditinDmn -0.341  0.340                                          
conditinFct -0.800  0.797  0.274                                   
conditnLrnr -0.757  0.756  0.259  0.607                            
cndtnFct&Lr  0.769 -0.810 -0.263 -0.615 -0.584                     
alph:cndtnD  0.388 -0.390 -0.998 -0.311 -0.295  0.316              
alph:cndtnF  0.847 -0.851 -0.290 -0.991 -0.643  0.691  0.332       
alph:cndtnL  0.808 -0.813 -0.277 -0.647 -0.992  0.658  0.317  0.691
fit warnings:
fixed-effect model matrix is rank deficient so dropping 1 column / coefficient

We can surmise that the convergence point is sensitive to both the initial estimate and the final estimate. When conditions have unequal initial estimates, as is the case here (the initial estimates in the adaptive conditions tend to be higher than those in the default condition), this confounds the results.

It is clear that the operationalisation of the convergence point is problematic: one would come to the conclusion that a lower initial estimate is always better, simply because that is rewarded in the current analysis. We may want to think about a measure that is robust to this.

Ideas:

  • Split the data on initial estimate (lower than 0.3 or higher than 0.3) and see if convergence improves below 0.3.
  • ...

The lower convergence success in adaptive distributions may be because the final alpha might be higher in these conditions. There seems to be a slightly lower final rate of forgetting in the default condition compared to the adaptive conditions, but this mostly disappears in pairwise comparisons (also when we only look at converged estimates), and is also not supported by a Bayesian lmBF, which slightly prefers a model without condition.

p_final_rof <- ggplot(conv, aes(x = condition, y = final_alpha)) +
  geom_violin() +
  geom_jitter(alpha = 0.1, aes(colour = condition)) +
  geom_boxplot(outlier.shape = NA, width = 0.5) +
  labs(x = "Condition",
       y = "Final rate of forgetting") +
  guides(colour = FALSE)
tikz(file = "../output/final_rof_by_cond.tex", width = 4, height = 2.5)
p_final_rof +
  scale_x_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()
null device 
          1 
p_final_rof

summary(m_final_rof_cond <- lmer(final_alpha ~ condition + (1 | subject) + (1 | fact_id), data = conv))
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: final_alpha ~ condition + (1 | subject) + (1 | fact_id)
   Data: conv

REML criterion at convergence: -10247.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6263 -0.7031 -0.0942  0.5698  5.9523 

Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 0.001176 0.03430 
 fact_id  (Intercept) 0.001135 0.03369 
 Residual             0.004902 0.07001 
Number of obs: 4364, groups:  subject, 291; fact_id, 30

Fixed effects:
                         Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)             3.734e-01  8.221e-03 8.159e+01  45.417  < 2e-16 ***
conditionDomain         2.164e-02  7.691e-03 2.677e+02   2.813  0.00527 ** 
conditionFact           1.604e-02  6.932e-03 2.675e+02   2.314  0.02141 *  
conditionLearner        2.352e-02  7.654e-03 2.681e+02   3.072  0.00234 ** 
conditionFact & Learner 2.009e-02  7.724e-03 2.668e+02   2.601  0.00981 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) cndtnD cndtnF cndtnL
conditinDmn -0.470                     
conditinFct -0.522  0.558              
conditnLrnr -0.473  0.505  0.561       
cndtnFct&Lr -0.468  0.500  0.556  0.503
bf_final_rof_cond <- lmBF(
  formula = final_alpha ~ condition + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = mutate(ungroup(conv), subject = as.factor(subject), fact_id = as.factor(fact_id)),
  progress = FALSE
)
data coerced from tibble to data frame
  
bf_final_rof_nocond <- lmBF(
  formula = final_alpha ~ subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = mutate(ungroup(conv), subject = as.factor(subject), fact_id = as.factor(fact_id)),
  progress = FALSE
)
data coerced from tibble to data frame
  
bf_final_rof_cond / bf_final_rof_nocond 
Bayes factor analysis
--------------
[1] condition + subject + fact_id : 0.2735443 ±0.69%

Against denominator:
  final_alpha ~ subject + fact_id 
---
Bayes factor type: BFlinearModel, JZS

Tangential: how similar are final rates of forgetting for specific facts between conditions? Looks fairly consistent across conditions:

conv %>%
  group_by(condition, fact_id) %>%
  summarise(alpha = mean(final_alpha),
            n = n()) %>%
  # mutate(rank = dense_rank(desc(alpha))) %>%
  ggplot(aes(x = condition, y = alpha, group = fact_id, colour = as.factor(fact_id))) +
  geom_line(alpha = 0.75) +
  geom_point(aes(size = n)) +
  guides(colour = FALSE)


Make diagnostic plots:

library(patchwork)
# Change in rate of forgetting with each repetition
p_rof_change <- convergence_traces %>%
  group_by(repetition) %>%
  summarise(mean_change = mean(abs_d_alpha),
            se_change = plotrix::std.error(abs_d_alpha)) %>%
  ggplot(aes(x = repetition, y = mean_change)) +
  geom_point() +
  geom_errorbar(aes(ymin= mean_change - se_change, ymax = mean_change + se_change), na.rm = T) +
  geom_hline(yintercept = 0.00496, colour = "red", lty = 2) +
  labs(x = "Repetition",
       y = "Absolute rate of forgetting change")
  
# Convergence point ~ total number of repetitions
p_conv_reps <- ggplot(filter(convergence, !is.na(convergence_point)), aes(x = total_reps, y = convergence_point)) +
  geom_jitter(alpha = 0.2) +
  coord_equal() +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  labs(x = "Total repetitions",
      y = "Convergence point")
p_conv_point_initial <- ggplot(conv, aes(x = alpha, y = convergence_point, colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "lm", se = T, na.rm = T, size = .75) +
  # geom_rug(aes(y = NULL), sides = "b") +
  labs(x = "Initial rate of forgetting",
         y = "Convergence point")
p_conv_point_final <- ggplot(conv, aes(x = final_alpha, y = convergence_point, colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "lm", se = T, na.rm = T, size = .75) +
  # geom_rug(aes(y = NULL), sides = "b") +
  labs(x = "Final rate of forgetting",
         y = "Convergence point")
p_conv_prob_initial <- ggplot(conv, aes(x = alpha, y = as.numeric(converges), colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "glm",  method.args=list(family="binomial"), se = T, na.rm = T) +
  # geom_rug(aes(y = NULL), sides = "b") +
  ylim(0, 1) +
  labs(x = "Initial rate of forgetting",
       y = "Convergence probability")
p_conv_prob_final <- ggplot(conv, aes(x = final_alpha, y = as.numeric(converges), colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "glm",  method.args=list(family="binomial"), se = T, na.rm = T) +
  # geom_rug(aes(y = NULL), sides = "b") +
  ylim(0, 1) +
  labs(x = "Final rate of forgetting",
       y = "Convergence probability")
p_rof_change + p_conv_reps + p_conv_point_initial + p_conv_prob_initial + p_conv_point_final + p_conv_prob_final + plot_layout(ncol = 2, byrow = TRUE)
ggsave("../output/convergence_diagnostics.pdf", width = 12, height = 12)

Plot the RoF change plot for each condition separately:

p_rof_change_cond <- convergence_traces %>%
  group_by(condition, repetition) %>%
  summarise(mean_change = mean(abs_d_alpha),
            se_change = plotrix::std.error(abs_d_alpha)) %>%
  ggplot(aes(x = repetition, y = mean_change, colour = condition)) +
  geom_line() +
  geom_point() +
  geom_errorbar(aes(ymin= mean_change - se_change, ymax = mean_change + se_change), na.rm = T) +
  geom_hline(yintercept = 0.00496, colour = "red", lty = 2) +
  labs(x = "Repetition",
       y = "Absolute rate of forgetting change") +
  xlim(0,10) 
p_rof_change_cond

Also plot each diagnostic plot separately:

tikz(file = "../output/rof_change.tex", width = 4, height = 2.5)
p_rof_change +
  theme_light()
dev.off()
null device 
          1 
tikz(file = "../output/conv_reps.tex", width = 2.5, height = 2.5)
p_conv_reps +
  theme_light()
dev.off()
null device 
          1 
tikz(file = "../output/conv_point_initial_rof.tex", width = 4, height = 2)
p_conv_point_initial +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()
null device 
          1 
tikz(file = "../output/conv_point_final_rof.tex", width = 4, height = 2)
p_conv_point_final +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()
null device 
          1 
tikz(file = "../output/conv_prob_initial_rof.tex", width = 4, height = 2)
p_conv_prob_initial +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()
null device 
          1 
tikz(file = "../output/conv_prob_final_rof.tex", width = 4, height = 2)
p_conv_prob_final +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()
null device 
          1 

Session information

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_0.0.1        lmerTest_3.1-0         lme4_1.1-21            tikzDevice_0.12        forcats_0.4.0          fst_0.9.0              tidyr_0.8.3           
 [8] stringr_1.4.0          readr_1.3.1            purrr_0.3.2            ggplot2_3.1.1          dplyr_0.8.0.1          BayesFactor_0.9.12-4.2 Matrix_1.2-17         
[15] coda_0.19-2           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1          mvtnorm_1.0-10      lattice_0.20-38     gtools_3.8.1        assertthat_0.2.1    digest_0.6.18       R6_2.4.0            plyr_1.8.4         
 [9] MatrixModels_0.4-1  evaluate_0.13       pillar_1.4.0        rlang_0.3.4         lazyeval_0.2.2      rstudioapi_0.10     minqa_1.2.4         nloptr_1.2.1       
[17] rmarkdown_1.12      labeling_0.3        splines_3.6.1       munsell_0.5.0       compiler_3.6.1      numDeriv_2016.8-1.1 xfun_0.6            pkgconfig_2.0.2    
[25] base64enc_0.1-3     htmltools_0.3.6     tidyselect_0.2.5    tibble_2.1.1        crayon_1.3.4        withr_2.1.2         MASS_7.3-51.1       grid_3.6.1         
[33] nlme_3.1-140        jsonlite_1.6        gtable_0.3.0        magrittr_1.5        scales_1.0.0        stringi_1.4.3       pbapply_1.4-0       reshape2_1.4.3     
[41] boot_1.3-22         RColorBrewer_1.1-2  tools_3.6.1         glue_1.3.1          hms_0.4.2           plotrix_3.7-6       parallel_3.6.1      yaml_2.2.0         
[49] colorspace_1.4-1    filehash_2.4-2      knitr_1.22         
---
title: "Test of H4: The rate of forgetting estimate converges on its final value more quickly when its initial value was predicted using learning history"
author: "Maarten van der Velde"
date: "Last updated: `r Sys.time()`"
output:
  html_notebook:
    smart: no
    toc: yes
    toc_float: yes
  html_document:
    df_print: paged
    toc: yes
    toc_float: yes
---

# Overview

This notebook contains the analysis of Hypothesis 4, as preregistered at https://osf.io/vwg6u/:

> The absolute distance between intermediate estimates and the final estimate of the rate of forgetting for a particular learner/fact combination during a learning session will converge on zero (i.e., fall within a specified boundary above zero) within fewer adjustments when the initial estimate of the rate of forgetting was based on the most predictive combination(s) of learner and/or fact history than when the initial estimate of the rate of forgetting was set at the default value of 0.3.

# Setup

## Load packages
```{r}
library(BayesFactor)
library(dplyr)
library(forcats)
library(ggplot2)
library(purrr)
library(readr)
library(stringr)
library(tidyr)
library(tikzDevice)
library(lme4)
library(lmerTest)
library(fst)

theme_set(theme_light(base_size = 14) +
  theme(strip.text = element_text(colour = "black")))

knitr::opts_chunk$set(fig.width=16, fig.height=16) 
```

## Load data
```{r}
responses_lab_2_with_model_params <- read.fst(file.path("..", "data", "processed", "lab", "session2", "session2_rl2_with_alpha_lab.fst"))
responses_mturk_2_with_model_params <- read.fst(file.path("..", "data", "processed", "mturk", "session2", "session2_rl2_with_alpha_mturk.fst"))

fix_condition_labels <- function(x) {
  rowwise(x) %>%
  mutate(condition = str_replace(condition, "-and-", " & ") %>%
           str_replace("student", "learner") %>%
           str_to_title()) %>%
  ungroup() %>%
  mutate(condition = fct_relevel(condition, "Fact & Learner", after = Inf)) # Move F&L level to the end
}

block2_with_model_params <- bind_rows(mutate(responses_lab_2_with_model_params, dataset = "Lab"),
                                      mutate(responses_mturk_2_with_model_params, dataset = "MTurk")) %>%
  fix_condition_labels()
```

## Subset data
The preregistration specified 22 April 2019 as the cutoff date for data collection.
We reached the minimum of 25 participants per condition in the lab sample before this date, but not in the Mechanical Turk sample.
For that reason data collection on MTurk continued until there were at least 25 participants per condition (14 May 2019) and was stopped only then.
The analysis reported in the paper is based only on the data from before the cutoff date, but this notebook also reports the same analysis done on the full dataset. 

```{r}
subjects_cutoff <- read_csv(file.path("..", "data", "processed", "subjects_before_cutoff.csv"))

block_2_full <- block2_with_model_params 

block_2_cutoff <- block2_with_model_params %>%
  right_join(subjects_cutoff, by = "subject")
```


# Convergence of the estimate

Each trial has a value `alpha` that expresses the estimated rate of forgetting of the fact at the start of the trial.
The rate of forgetting estimate starts at the predicted value (which depends on the condition).
It is then updated after each repetition of the fact to better match the observed responses.

```{r}
alpha_change <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  arrange(subject, condition, fact_id, repetition) %>%
  mutate(d_alpha = alpha - lag(alpha)) %>%
  mutate(d_alpha = ifelse(is.na(d_alpha), 0, d_alpha)) %>%
  mutate(abs_d_alpha = abs(d_alpha)) %>%
  mutate(final_alpha = tail(alpha,1)) %>%
  ungroup() %>%
  select(dataset, subject, condition, fact_id, repetition, alpha, d_alpha, abs_d_alpha, final_alpha)
```

The plot below visualises the development of each alpha estimate (every learner/fact combination is represented by a line).
It shows several things:

- The initial rate of forgetting estimate differs depending on the condition: in conditions with a domain-wide prediction (Default and Domain) the same value is always used, while in conditions with more individualised predictions (Fact and/or Learner) the starting value differs between facts and/or learners.
- The rate of forgetting estimate is only changed *after* the third presentation of the fact. Up to that point it is always a horizontal line.
- Because of the way the scheduling algorithm works, there is invariably a plume towards the top right: facts with a lower rate of forgetting are repeated less frequently than facts with a higher rate of forgetting.

```{r, fig.width=12, fig.height=6}
ggplot(alpha_change, aes(x = repetition, y = alpha, group = interaction(subject, fact_id))) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_line(alpha = 0.1) +
  labs(x = "Presentation",
       y = "Rate of forgetting")
```


Our question is whether the model can find the correct rate of forgetting more quickly in conditions that use a predicted value as their starting point.

The plot below shows the *change* in the rate of forgetting estimate compared to the previous presentation.
Note that changes are capped at 0.0496 in both directions.
It appears that many of the changes are as large as possible.

```{r, fig.width=12, fig.height=6}
ggplot(alpha_change, aes(x = repetition, y = d_alpha, group = interaction(subject, fact_id))) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_line(alpha = 0.1) +
  labs(x = "Presentation",
       y = "Change in rate of forgetting")
```


The distribution of changes (shown below, excluding the first three presentations since they can never have changes) confirms that many changes are either as large as possible, or almost zero.
The dotted horizontal lines show the boundaries of what we consider to be the convergence zone (0.00496 on either side of zero).
This plot also gives an indication of the balance between upward and downward changes.
It is especially apparent that changes in the Default condition are biased towards upward changes, which makes sense given the general diffulty of the material.
It looks like the changes may be more balanced in other conditions.

```{r, fig.width=12, fig.height=6}
alpha_change %>%
  filter(repetition > 3) %>%
ggplot(aes(x = condition, y = d_alpha)) +
  facet_grid(dataset ~ ., labeller = label_wrap_gen()) +
  geom_jitter(aes(colour = condition), width = 0.1, height = 0, alpha = 0.1) +
  geom_violin(fill = NA) +
  geom_hline(yintercept = c(0.00496, -0.00496), lty = 2) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Change in rate of forgetting",
       caption = "Excluding the first 3 trials, in which RoF cannot change.")
```


```{r}
alpha_change %>%
  filter(repetition > 3) %>%
ggplot(aes(x = d_alpha)) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_histogram(aes(fill = abs_d_alpha <= 0.00496), binwidth = 0.002) +
  labs(x = "Change in rate of forgetting",
       fill = "Within\nconvergence\nzone",
       caption = "Excluding the first 3 trials, in which RoF cannot change.")

```




The number of presentations per fact, which we expected to drop with better prediction, in fact seems to be *higher* in predictive conditions, if anything:

```{r fig.width=12}
n_reps <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  summarise(repetitions = max(repetition))

ggplot(filter(n_reps, repetitions >= 4), aes(x = condition, y = repetitions)) +
  facet_grid(dataset ~ .) +
  geom_violin() +
  geom_jitter(aes(colour = condition), width = 0.05, height = 0, alpha = 0.5) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Repetitions",
       caption = "Excluding facts with fewer than 4 repetitions")
```

This might simply be an effect of the high difficulty of items overall: in most cases, if the rate of forgetting is estimated more accurately from the start, the item *should* be repeated more often since it is correctly assessed as more difficult.
We would expect that for the easiest facts (of which there are only a few), prediction does decrease repetitions compared to the default condition.

As a sanity check, verify that the number of repetitions scales with the final alpha estimate:
```{r fig.width=12}
n_reps_by_alpha <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  summarise(repetitions = max(repetition),
            alpha = alpha[which.max(repetition)])

ggplot(filter(n_reps_by_alpha, repetitions >= 4), aes(x = alpha, y = repetitions, colour = condition)) +
  facet_grid(dataset ~ condition, labeller = label_wrap_gen()) +
  geom_jitter(alpha = 0.5) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE)
```


The plot below shows only the number of repetitions for cases in which the final alpha is lower than 0.3, and hints at an improvement (particularly in the MTurk data).
Starting off with a lower-frequency repetition schedule indeed seems to lead to fewer repetitions overall.

```{r}
n_reps_by_alpha <- block_2_full %>%
  group_by(dataset, subject, condition, fact_id) %>%
  summarise(repetitions = max(repetition),
            alpha = alpha[which.max(repetition)])

ggplot(filter(n_reps_by_alpha, repetitions >= 4, alpha < 0.3), aes(x = condition, y = repetitions, colour = condition)) +
  facet_grid(dataset ~ .) +
  geom_violin() +
  geom_jitter(width = 0.05, height = 0, alpha = 0.5) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Repetitions",
       caption = "Only showing facts with alpha < 0.3; excluding facts with fewer than 4 repetitions")
```

Indeed, an exploratory Bayesian ANOVA shows strong evidence for an effect of condition on the number of repetitions among low-alpha items (but no effect of dataset):
```{r}
n_reps_by_alpha_low <- n_reps_by_alpha %>%
  ungroup() %>%
  mutate_if(is_character, as.factor) %>%
  filter(repetitions >= 4, alpha < 0.3)

anovaBF(
  repetitions ~ condition * dataset,
  data = n_reps_by_alpha_low,
  progress = FALSE
)
```

Follow-up t-tests indicate that adapation in general has a beneficial effect...
```{r}
ttestBF(filter(n_reps_by_alpha_low, condition == "Default")$repetitions, filter(n_reps_by_alpha_low, condition != "Default")$repetitions)
```

... but that there is only anecdotal evidence supporting a more individualised adaptation over a domain-wide adaptation:
```{r}
ttestBF(filter(n_reps_by_alpha_low, condition == "Domain")$repetitions, filter(n_reps_by_alpha_low, condition %in% c("Fact & Learner", "Fact", "Learner"))$repetitions)
```



If we plot the cumulative percentage of converged estimates, we see that estimates in the default condition are more likely to converge than those in other conditions---not what we expected.  


```{r fig.width=12}
# convergence <- alpha_change %>%
#   group_by(dataset, subject, condition, fact_id) %>%
#   filter(abs_d_alpha > 0.00496) %>% # Keep cases in which the adjustment is outside the window
#   summarise(convergence_point = max(repetition) + 1) %>%
#   ungroup() %>%
#   mutate(dataset = as.factor(dataset),
#          subject = as.factor(subject),
#          condition = as.factor(condition),
#          fact_id = as.factor(fact_id))


# convergence <- alpha_change %>%
#   group_by(dataset, subject, condition, fact_id) %>%
#   mutate(within_boundary = abs_d_alpha <= 0.00496) %>%
#   mutate(total_reps = max(repetition)) %>%
#   filter(!within_boundary) %>%
#   summarise(total_reps = total_reps[1], convergence_point = max(repetition) + 1, final_alpha = alpha[which.max(repetition)]) %>%
#   mutate(convergence_point = ifelse(convergence_point > total_reps, NA, convergence_point))

convergence <- alpha_change %>%
  group_by(dataset, subject, condition, fact_id) %>%
  filter(max(repetition) > 3) %>%
  mutate(within_boundary = abs_d_alpha <= 0.00496) %>%
  mutate(total_reps = max(repetition)) %>%
  summarise(total_reps = total_reps[1],
            convergence_point = last(repetition[!within_boundary]) + 1) %>%
  mutate(convergence_point = ifelse(convergence_point > total_reps, NA, convergence_point))


convergence_cumulative <- convergence %>%
  group_by(dataset, condition) %>%
  count(convergence_point) %>%
  complete(nesting(dataset, condition), convergence_point = c(0:max(convergence$convergence_point, na.rm = T), NA), fill = list(n = 0)) %>%
  mutate(perc_converged = n / sum(n)) %>%
  mutate(perc_converged = cumsum(perc_converged)) %>%
  ungroup() %>%
  arrange(dataset, condition, convergence_point) %>%
  fill(perc_converged)
  

convergence_cumulative_combined <- convergence %>%
  group_by(condition) %>%
  count(convergence_point) %>%
  complete(nesting(condition), convergence_point = c(0:max(convergence$convergence_point, na.rm = T), NA), fill = list(n = 0)) %>%
  mutate(perc_converged = n / sum(n)) %>%
  mutate(perc_converged = cumsum(perc_converged)) %>%
  ungroup() %>%
  arrange(condition, convergence_point) %>%
  fill(perc_converged)

```



```{r fig.width=12}
convergence_cumulative_combined %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = convergence_point, y = perc_converged, colour = condition, lty = condition)) +
  # facet_wrap(~ dataset) +
  geom_line(size = 1) +
  scale_colour_brewer(type = "qual", palette = 7) +
  scale_y_continuous(labels = scales::percent_format()) +
  expand_limits(x = 0, y = 0) +
  labs(x = "Repetition",
       y = "Estimates converged",
       colour = "Condition",
       lty = "Condition")
       # caption = "Convergence means that there are no more adjustments larger than 0.00496")

ggsave("../output/cum_perc_converged.pdf", device = "pdf", width = 5, height = 3)
```

Make the plot shown in the paper:
```{r}
convergence_cumulative_combined_tex <- convergence_cumulative_combined
levels(convergence_cumulative_combined_tex$condition)[5] <- "Fact \\& Learner"

tikz(file = "../output/cum_perc_converged.tex", width = 4.75, height = 2)

convergence_cumulative_combined_tex %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = convergence_point, y = perc_converged, colour = condition, lty = condition)) +
  geom_line(size = 1) +
  scale_colour_brewer(type = "qual", palette = 7) +
  scale_y_continuous(labels = scales::percent_format(suffix = "\\%")) +
  expand_limits(x = 0, y = 0) +
  labs(x = "Repetition",
       y = "Estimates converged",
       colour = "Condition",
       lty = "Condition") +
  theme_light()

dev.off()
```



What proportion of estimates converged in each condition?
```{r}
convergence_cumulative %>%
  group_by(dataset, condition) %>%
  filter(!is.na(convergence_point)) %>%
  summarise(n_converged = sum(n),
            perc_converged = max(perc_converged))
```



For the estimates that did converge, plot the distribution of convergence points by condition:
```{r}
convergence %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = condition, y = convergence_point)) +
  # facet_grid(dataset ~ ., labeller = label_wrap_gen()) +
  geom_jitter(aes(colour = condition), width = 0.1, height = 0, alpha = 0.1) +
  geom_violin(fill = NA) +
  expand_limits(y = 0) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Convergence point")

ggsave("../output/convergence_point.pdf", device = "pdf", width = 5, height = 3)

```

Make the plot shown in the paper:
```{r}
convergence_tex <- convergence
levels(convergence_tex$condition)[5] <- "Fact \\& Learner"

tikz(file = "../output/convergence_point.tex", width = 4.75, height = 2)

convergence_tex %>%
  filter(!is.na(convergence_point)) %>%
  ggplot(aes(x = condition, y = convergence_point)) +
  # facet_grid(dataset ~ ., labeller = label_wrap_gen()) +
  geom_jitter(aes(colour = condition), width = 0.1, height = 0, alpha = 0.1) +
  geom_violin(fill = NA) +
  expand_limits(y = 0) +
  scale_colour_brewer(type = "qual", palette = 7) +
  guides(colour = FALSE) +
  labs(x = NULL,
       y = "Convergence point") +
  theme_light()

dev.off()
```



As preregistered, we conduct a Bayesian ANOVA testing the effect of condition on the convergence point.

There is very strong evidence for this model compared to a null model.
```{r}
convergence_dat <- convergence %>%
  ungroup() %>%
  filter(!is.na(convergence_point)) %>%
  mutate_if(is.character, as.factor) %>%
  mutate(fact_id = as.factor(fact_id))

bf_convergence <- lmBF(
  formula = convergence_point ~ condition * dataset + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)
  
bf_convergence  

1/bf_convergence
```
To check whether we should test the two datasets separately, compare the first model to one without the interaction between condition and dataset.
This comparison shows that there is strong evidence for the model without the interaction and against the model with interaction.
```{r}
bf_convergence_nointeraction <- lmBF(
  formula = convergence_point ~ condition + dataset + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)
  
bf_convergence_nointeraction / bf_convergence
```

A model that leaves out dataset altogether is supported even more strongly by the evidence, showing that we can assume no effect of dataset on the convergence point.
```{r}
bf_convergence_nodataset <- lmBF(
  formula = convergence_point ~ condition + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)

bf_convergence_nodataset / bf_convergence_nointeraction
```

Compare to the maximal model:
```{r}
bf_convergence_nodataset / bf_convergence
```


Compare the preferred simpler model to a model without condition effect:
```{r}
bf_convergence_nocondition <-  lmBF(
  formula = convergence_point ~ subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = convergence_dat,
  progress = FALSE
)

bf_convergence_nocondition / bf_convergence_nodataset 
```

```{r}
bf_convergence_nocondition / bf_convergence_nointeraction
bf_convergence_nocondition / bf_convergence
```


Although it's not the best model, we can sample from the posterior of the model *with* condition to see what it would predict:
```{r}
samples <- posterior(bf_convergence_nodataset, iterations = 1000, progress = FALSE)
summary(samples[,1:6])
```




---

# Exploratory: the effect of the RoF estimate on convergence

## Convergence vs. no convergence

```{r}
convergence_traces <- convergence %>%
  mutate(converges = !is.na(convergence_point)) %>%
  left_join(alpha_change, by = c("dataset", "subject", "condition", "fact_id"))
```

### Trajectories
```{r}
ggplot(convergence_traces, aes(x = repetition, y = alpha, colour = interaction(subject, fact_id))) +
  facet_grid(converges ~ condition) +
  geom_line(alpha = 0.1) +
  geom_point(alpha = 0.1) +
  guides(colour = FALSE) +
  labs(title = "Trajectories that do not converge (top) and those that do (bottom)",
       y = "Rate of forgetting")

ggplot(convergence_traces, aes(x = repetition, y = d_alpha, colour = interaction(subject, fact_id))) +
  facet_grid(converges ~ condition) +
  geom_line(alpha = 0.1) +
  geom_point(alpha = 0.1) +
  geom_hline(yintercept = c(-0.00496, 0.00496), lty = 2) +
  guides(colour = FALSE) +
  labs(title = "Trajectories that do not converge (top) and those that do (bottom)",
       y = "Change in rate of forgetting")

```


Indexed on the final value:
```{r}
convergence_traces %>%
  group_by(dataset, subject, condition, fact_id) %>%
  mutate(alpha_endaligned = alpha - final_alpha) %>%
  ggplot(aes(x = repetition, y = alpha_endaligned, colour = interaction(subject, fact_id))) +
  facet_grid(converges ~ condition) +
  geom_line(alpha = 0.1) +
  geom_point(alpha = 0.1) +
  guides(colour = FALSE) +
  labs(title = "Trajectories that do not converge (top) and those that do (bottom)",
       y = "Rate of forgetting (indexed on final value)")

```


### Convergence point ~ number of repetitions

The convergence point is by definition capped to the total number of repetitions a fact receives (at the latest, we can observe convergence at the final repetition).
But it is also quite clear that convergence rarely happens before the final repetition, which requires more than two small changes in a row.
Pearson's r: `r cor(convergence$convergence_point, convergence$total_reps, use = "complete.obs")`.
```{r}
ggplot(filter(convergence, !is.na(convergence_point)), aes(x = total_reps, y = convergence_point)) +
  geom_jitter(alpha = 0.2) +
  coord_equal() +
  geom_abline(slope = 1, intercept = 0, lty = 2)
```



Do changes in RoF become smaller as the number of repetitions increases?
As the plot below shows: yes, but up to a point. After about 7 repetitions the size of the change stays more or less constant.
```{r}
convergence_traces %>%
  group_by(repetition) %>%
  summarise(mean_change = mean(abs_d_alpha),
            se_change = plotrix::std.error(abs_d_alpha)) %>%
  ggplot(aes(x = repetition, y = mean_change)) +
  geom_point() +
  geom_errorbar(aes(ymin= mean_change - se_change, ymax = mean_change + se_change)) +
  labs(x = "Repetition",
       y = "Mean change in RoF estimate",
       caption = "Error bars show +/- 1 standard error of the mean")
```


### Convergence point ~ final RoF

The convergence point seems to depend on the final rate of forgetting: the higher the rate of forgetting, the later convergence happens.
Importantly, this trend seems to exist in all conditions.
```{r}
quantiles <- quantile(filter(distinct(convergence_traces, dataset, subject, condition, fact_id, converges, final_alpha), converges)$final_alpha,
                      probs = seq(0, 1, 0.2))

convergence_traces %>%
  filter(converges) %>%
  ungroup() %>%
  mutate(final_alpha_bin = cut(final_alpha, quantiles, include.lowest = TRUE)) %>%
  group_by(dataset, subject, condition, fact_id, final_alpha) %>%
  slice(n()) %>%
  ggplot(aes(x = final_alpha_bin, y = convergence_point)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = final_alpha_bin), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE) +
  labs(x = "Final rate of forgetting (binned in quantiles)",
       y = "Convergence point") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
   
```

```{r}
convergence_final_rof <- distinct(convergence_traces, dataset, subject, condition, fact_id, final_alpha, converges, convergence_point) %>%
  ungroup() %>%
  filter(converges) %>%
  mutate(dataset = as.factor(dataset),
         subject = as.factor(subject),
         fact_id = as.factor(fact_id))

m_conv_final_rof <- lmer(convergence_point ~ final_alpha * condition  + (1 | subject) + (1 | fact_id), data = convergence_final_rof)

summary(m_conv_final_rof)
```

The lmer confirms a large, positive main effect of final rate of forgetting on convergence point, but no effect of, or interaction with, condition.

This implies that, regardless of the condition, convergence will happen sooner if the final estimate is lower.


Compare the final rate of forgetting distribution of estimates that did converge to those that did not.
```{r}
convergence_traces %>%
  distinct(dataset, subject, condition, fact_id, .keep_all = TRUE) %>%
  ggplot(aes(x = converges, y = final_alpha)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = converges), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE)

```


```{r}
conv <- convergence_traces %>%
  distinct(dataset, subject, condition, fact_id, .keep_all = TRUE)

m_conv_final <- lmer(converges ~ final_alpha * condition + (1 | subject) + (1 | fact_id), data = conv)
summary(m_conv_final)
```


Takeaway:

- Estimates that do converge tend to have a higher final value than estimates that don't (which makes some sense, since there are more opportunities for adjustment/stabilising).
- For estimates that converge, convergence happens sooner when the final value is lower.
- Tradeoff: lower final value means less likely to converge, but when convergence does happen, it's faster. 


### Convergence point ~ initial RoF

Is the above also true for the initial rate of forgetting estimate?
Maybe, judging by the plot below (note that the quantiles are recalculated based on initial alpha values).
```{r}
quantiles <- quantile(filter(convergence_traces, converges, repetition == 1)$alpha,
                      probs = seq(0, 1, 0.2))

convergence_traces %>%
  filter(converges, repetition == 1) %>%
  ungroup() %>%
  mutate(alpha_bin = cut(alpha, quantiles, include.lowest = TRUE)) %>%
  ggplot(aes(x = alpha_bin, y = convergence_point)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = alpha_bin), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE) +
  labs(x = "Initial rate of forgetting (binned in quantiles)",
       y = "Convergence point") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

```

```{r}
convergence_init_rof <- convergence_traces %>%
  ungroup() %>%
  filter(converges, repetition == 1) %>%
  mutate(dataset = as.factor(dataset),
         subject = as.factor(subject),
         fact_id = as.factor(fact_id))

m_conv_init_rof <- lmer(convergence_point ~ alpha * condition  + (1 | subject) + (1 | fact_id), data = convergence_init_rof)

summary(m_conv_init_rof)
```


The lmer confirms that the convergence point increases as a function of the initial rate of forgetting (`alpha`) in all conditions, even accounting for the interaction between the initial alpha and the condition.

Convergence seems to be about equally likely for low and high initial values.
```{r}
convergence_traces %>%
  distinct(dataset, subject, condition, fact_id, .keep_all = TRUE) %>%
  ggplot(aes(x = converges, y = alpha)) +
  facet_grid(~ condition) +
  geom_jitter(aes(colour = converges), width = 0.2, alpha = 0.2) +
  geom_violin(fill = NA) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  guides(colour = FALSE)

```


The lack of an effect is confirmed by an lmer:
```{r}
m_conv_init <- lmer(converges ~ alpha * condition + (1 | subject) + (1 | fact_id), data = conv)
summary(m_conv_init)
```



We can surmise that the convergence point is sensitive to both the initial estimate and the final estimate.
When conditions have unequal initial estimates, as is the case here (the initial estimates in the adaptive conditions tend to be higher than those in the default condition), this confounds the results.

It is clear that the operationalisation of the convergence point is problematic: one would come to the conclusion that a lower initial estimate is always better, simply because that is rewarded in the current analysis.
We may want to think about a measure that is robust to this.

Ideas:

- Split the data on initial estimate (lower than 0.3 or higher than 0.3) and see if convergence improves below 0.3.
- ...



The lower convergence success in adaptive distributions may be because the final alpha might be higher in these conditions.
There seems to be a slightly lower final rate of forgetting in the default condition compared to the adaptive conditions, but this mostly disappears in pairwise comparisons (also when we only look at converged estimates), and is also not supported by a Bayesian lmBF, which slightly prefers a model without condition.
```{r}
p_final_rof <- ggplot(conv, aes(x = condition, y = final_alpha)) +
  geom_violin() +
  geom_jitter(alpha = 0.1, aes(colour = condition)) +
  geom_boxplot(outlier.shape = NA, width = 0.5) +
  labs(x = "Condition",
       y = "Final rate of forgetting") +
  guides(colour = FALSE)

tikz(file = "../output/final_rof_by_cond.tex", width = 4, height = 2.5)
p_final_rof +
  scale_x_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()

p_final_rof
```

```{r}
summary(m_final_rof_cond <- lmer(final_alpha ~ condition + (1 | subject) + (1 | fact_id), data = conv))

bf_final_rof_cond <- lmBF(
  formula = final_alpha ~ condition + subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = mutate(ungroup(conv), subject = as.factor(subject), fact_id = as.factor(fact_id)),
  progress = FALSE
)
  
bf_final_rof_nocond <- lmBF(
  formula = final_alpha ~ subject + fact_id,
  whichRandom = c("subject", "fact_id"),
  data = mutate(ungroup(conv), subject = as.factor(subject), fact_id = as.factor(fact_id)),
  progress = FALSE
)
  
bf_final_rof_cond / bf_final_rof_nocond 

```

Tangential: how similar are final rates of forgetting for specific facts between conditions?
Looks fairly consistent across conditions:
```{r}
conv %>%
  group_by(condition, fact_id) %>%
  summarise(alpha = mean(final_alpha),
            n = n()) %>%
  # mutate(rank = dense_rank(desc(alpha))) %>%
  ggplot(aes(x = condition, y = alpha, group = fact_id, colour = as.factor(fact_id))) +
  geom_line(alpha = 0.75) +
  geom_point(aes(size = n)) +
  guides(colour = FALSE)
```


---

Make diagnostic plots:


```{r}
library(patchwork)

# Change in rate of forgetting with each repetition
p_rof_change <- convergence_traces %>%
  group_by(repetition) %>%
  summarise(mean_change = mean(abs_d_alpha),
            se_change = plotrix::std.error(abs_d_alpha)) %>%
  ggplot(aes(x = repetition, y = mean_change)) +
  geom_point() +
  geom_errorbar(aes(ymin= mean_change - se_change, ymax = mean_change + se_change), na.rm = T) +
  geom_hline(yintercept = 0.00496, colour = "red", lty = 2) +
  labs(x = "Repetition",
       y = "Absolute rate of forgetting change")

  
# Convergence point ~ total number of repetitions
p_conv_reps <- ggplot(filter(convergence, !is.na(convergence_point)), aes(x = total_reps, y = convergence_point)) +
  geom_jitter(alpha = 0.2) +
  coord_equal() +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  labs(x = "Total repetitions",
      y = "Convergence point")


p_conv_point_initial <- ggplot(conv, aes(x = alpha, y = convergence_point, colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "lm", se = T, na.rm = T, size = .75) +
  # geom_rug(aes(y = NULL), sides = "b") +
  labs(x = "Initial rate of forgetting",
         y = "Convergence point")

p_conv_point_final <- ggplot(conv, aes(x = final_alpha, y = convergence_point, colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "lm", se = T, na.rm = T, size = .75) +
  # geom_rug(aes(y = NULL), sides = "b") +
  labs(x = "Final rate of forgetting",
         y = "Convergence point")

p_conv_prob_initial <- ggplot(conv, aes(x = alpha, y = as.numeric(converges), colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "glm",  method.args=list(family="binomial"), se = T, na.rm = T) +
  # geom_rug(aes(y = NULL), sides = "b") +
  ylim(0, 1) +
  labs(x = "Initial rate of forgetting",
       y = "Convergence probability")

p_conv_prob_final <- ggplot(conv, aes(x = final_alpha, y = as.numeric(converges), colour = condition, fill = condition)) +
  facet_grid(~ dataset) +
  geom_point(alpha = 0.1, na.rm = T) +
  geom_smooth(method = "glm",  method.args=list(family="binomial"), se = T, na.rm = T) +
  # geom_rug(aes(y = NULL), sides = "b") +
  ylim(0, 1) +
  labs(x = "Final rate of forgetting",
       y = "Convergence probability")


p_rof_change + p_conv_reps + p_conv_point_initial + p_conv_prob_initial + p_conv_point_final + p_conv_prob_final + plot_layout(ncol = 2, byrow = TRUE)

ggsave("../output/convergence_diagnostics.pdf", width = 12, height = 12)

```

Plot the RoF change plot for each condition separately:
```{r}
p_rof_change_cond <- convergence_traces %>%
  group_by(condition, repetition) %>%
  summarise(mean_change = mean(abs_d_alpha),
            se_change = plotrix::std.error(abs_d_alpha)) %>%
  ggplot(aes(x = repetition, y = mean_change, colour = condition)) +
  geom_line() +
  geom_point() +
  geom_errorbar(aes(ymin= mean_change - se_change, ymax = mean_change + se_change), na.rm = T) +
  geom_hline(yintercept = 0.00496, colour = "red", lty = 2) +
  labs(x = "Repetition",
       y = "Absolute rate of forgetting change") +
  xlim(0,10) 

p_rof_change_cond
```


Also plot each diagnostic plot separately:
```{r}
tikz(file = "../output/rof_change.tex", width = 4, height = 2.5)
p_rof_change +
  theme_light()
dev.off()

tikz(file = "../output/conv_reps.tex", width = 2.5, height = 2.5)
p_conv_reps +
  theme_light()
dev.off()

tikz(file = "../output/conv_point_initial_rof.tex", width = 4, height = 2)
p_conv_point_initial +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()

tikz(file = "../output/conv_point_final_rof.tex", width = 4, height = 2)
p_conv_point_final +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()

tikz(file = "../output/conv_prob_initial_rof.tex", width = 4, height = 2)
p_conv_prob_initial +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()

tikz(file = "../output/conv_prob_final_rof.tex", width = 4, height = 2)
p_conv_prob_final +
  scale_color_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  scale_fill_discrete(labels = c("Default", "Domain", "Fact", "Learner", "Fact \\& Learner")) +
  theme_light()
dev.off()

```




---

# Session information
```{r}
sessionInfo()
```

